Show the code
import pandas as pd
import numpy as np
from lets_plot import *
LetsPlot.setup_html(isolated_frame=True)import pandas as pd
import numpy as np
from lets_plot import *
LetsPlot.setup_html(isolated_frame=True)# Learn morea about Code Cells: https://quarto.org/docs/reference/cells/cells-jupyter.html
# Include and execute your code here
df = pd.read_json("https://github.com/byuidatascience/data4missing/raw/master/data-raw/flights_missing/flights_missing.json")A SHORT (2-3 SENTENCES) PARAGRAPH THAT DESCRIBES KEY INSIGHTS TAKEN FROM METRICS IN THE PROJECT RESULTS THINK TOP OR MOST IMPORTANT RESULTS. (Note: this is not a summary of the project, but a summary of the results.)
A Client has requested this analysis and this is your one shot of what you would say to your boss in a 2 min elevator ride before he takes your report and hands it to the client.
Fix all of the varied missing data types in the data to be consistent (all missing values should be displayed as “NaN”). In your report include one record example (one row) from your new data, in the raw JSON format. Your example should display the “NaN” for at least one missing value.__
Here is one example of an item that has a couple of null values
df.replace([-999, "", " "], np.nan, inplace=True)
# Convert to correct datatype
df = df.convert_dtypes()
row = df.iloc[0]
empty_values = row[row.isna()]
# Display in json
example_record = empty_values.to_json(indent=4)
print(example_record){
"num_of_delays_late_aircraft":null
}
Which airport has the worst delays? Describe the metric you chose, and why you chose it to determine the “worst” airport. Your answer should include a summary table that lists (for each airport) the total number of flights, total number of delayed flights, proportion of delayed flights, and average delay time in hours.
I chose the metric of average delay hours. This I feel like is usually what determines a bad airport
# Include and execute your code here
# Convert the delays columns to numeric (handle "1500+" as NaN for simplicity)
df["num_of_delays_carrier"] = pd.to_numeric(df["num_of_delays_carrier"], errors='coerce')
df["num_of_delays_late_aircraft"] = pd.to_numeric(df["num_of_delays_late_aircraft"], errors='coerce')
df["num_of_delays_nas"] = pd.to_numeric(df["num_of_delays_nas"], errors='coerce')
df["num_of_delays_security"] = pd.to_numeric(df["num_of_delays_security"], errors='coerce')
df["num_of_delays_weather"] = pd.to_numeric(df["num_of_delays_weather"], errors='coerce')
# Calculate the proportion of delayed flights for each airport
df["total_delays"] = df["num_of_delays_carrier"] + df["num_of_delays_late_aircraft"] + df["num_of_delays_nas"] + df["num_of_delays_security"] + df["num_of_delays_weather"]
df["proportion_delayed"] = df["total_delays"] / df["num_of_flights_total"]
# Calculate average delay time in hours
df["avg_delay_hours"] = df["minutes_delayed_total"] / 60 / df["num_of_flights_total"]
# Find the airport with the worst delays based on total proportion of delays
worst_airport = df.loc[df["proportion_delayed"].idxmax()]
# Display the summary table for each airport
summary_df = df[["airport_name", "num_of_flights_total", "total_delays", "proportion_delayed", "avg_delay_hours"]]
summary_df| airport_name | num_of_flights_total | total_delays | proportion_delayed | avg_delay_hours | |
|---|---|---|---|---|---|
| 0 | Atlanta, GA: Hartsfield-Jackson Atlanta Intern... | 35048 | <NA> | <NA> | 0.221379 |
| 1 | Denver, CO: Denver International | 12687 | 3148 | 0.248128 | 0.240137 |
| 2 | <NA> | 12381 | 2432 | 0.19643 | 0.18157 |
| 3 | Chicago, IL: Chicago O'Hare International | 28194 | 9178 | 0.32553 | 0.377677 |
| 4 | San Diego, CA: San Diego International | 7283 | 1953 | 0.268159 | 0.209511 |
| ... | ... | ... | ... | ... | ... |
| 919 | Washington, DC: Washington Dulles International | 2799 | 443 | 0.158271 | 0.185566 |
| 920 | Chicago, IL: Chicago O'Hare International | 25568 | 4233 | 0.165559 | 0.200743 |
| 921 | San Diego, CA: San Diego International | 6231 | 1384 | 0.222115 | 0.195632 |
| 922 | San Francisco, CA: San Francisco International | 13833 | 4465 | 0.322779 | 0.433244 |
| 923 | Salt Lake City, UT: Salt Lake City International | 8804 | 1744 | 0.198092 | 0.203069 |
924 rows × 5 columns
What is the best month to fly if you want to avoid delays of any length? Describe the metric you chose and why you chose it to calculate your answer. Include one chart to help support your answer, with the x-axis ordered by month. (To answer this question, you will need to remove any rows that are missing the Month variable.)
I chose the porportion of delays. So basically the amount of delays compared to other months. From this September is the best moth to travel
# Remove rows with missing "month"
df = df.dropna(subset=["month"])
# Aggregate delay statistics per month
month_delay_summary = df.groupby("month").agg({
"num_of_delays_total": "sum",
"num_of_flights_total": "sum"
}).reset_index()
# Calculate proportion of delays
month_delay_summary["delay_proportion"] = (
month_delay_summary["num_of_delays_total"] / month_delay_summary["num_of_flights_total"]
)
# Order months correctly
months_order = [
"January", "February", "March", "April", "May", "June",
"July", "August", "September", "October", "November", "December"
]
month_delay_summary["month"] = pd.Categorical(month_delay_summary["month"], categories=months_order, ordered=True)
# Plot
p = ggplot(month_delay_summary, aes(x="month", y="delay_proportion")) + geom_bar(stat="identity") + ggtitle("Proportion of Delays by Month")
p.show()According to the BTS website, the “Weather” category only accounts for severe weather delays. Mild weather delays are not counted in the “Weather” category, but are actually included in both the “NAS” and “Late-Arriving Aircraft” categories. Your job is to create a new column that calculates the total number of flights delayed by weather (both severe and mild). You will need to replace all the missing values in the Late Aircraft variable with the mean. Show your work by printing the first 5 rows of data in a table. Use these three rules for your calculations:
a. 100% of delayed flights in the Weather category are due to weather
a. 30% of all delayed flights in the Late-Arriving category are due to weather
a. From April to August, 40% of delayed flights in the NAS category are due to weather. The rest of the months, the proportion rises to 65%
This is pretty cool. Combining these numbers gives us a more accurate representation of what is going on.
# Include and execute your code here
# Replace missing values in Late Aircraft with the mean
df["num_of_delays_late_aircraft"] = df["num_of_delays_late_aircraft"].astype(float).fillna(df["num_of_delays_late_aircraft"].mean())
# Create a new column for total weather delays
df["weather_delays"] = df["num_of_delays_weather"]
# Add 30% of Late Aircraft delays as weather delays
df["weather_delays"] += 0.3 * df["num_of_delays_late_aircraft"]
# Add NAS delays as weather delays (40% from April to August, else 65%)
df["weather_delays"] += df.apply(
lambda row: 0.4 * row["num_of_delays_nas"] if row["month"] in ["April", "May", "June", "July", "August"]
else 0.65 * row["num_of_delays_nas"], axis=1
)
# Display the first 5 rows with the new "weather_delays" column
df.head()| airport_code | airport_name | month | year | num_of_flights_total | num_of_delays_carrier | num_of_delays_late_aircraft | num_of_delays_nas | num_of_delays_security | num_of_delays_weather | ... | minutes_delayed_carrier | minutes_delayed_late_aircraft | minutes_delayed_nas | minutes_delayed_security | minutes_delayed_weather | minutes_delayed_total | total_delays | proportion_delayed | avg_delay_hours | weather_delays | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | ATL | Atlanta, GA: Hartsfield-Jackson Atlanta Intern... | January | 2005 | 35048 | <NA> | 1109.104072 | 4598 | 10 | 448 | ... | 116423 | 104415 | 207467 | 297 | 36931 | 465533 | <NA> | <NA> | 0.221379 | 3769.431222 |
| 1 | DEN | Denver, CO: Denver International | January | 2005 | 12687 | 1041 | 928.000000 | 935 | 11 | 233 | ... | 53537 | 70301 | 36817 | 363 | 21779 | 182797 | 3148 | 0.248128 | 0.240137 | 1119.15 |
| 2 | IAD | <NA> | January | 2005 | 12381 | 414 | 1058.000000 | 895 | 4 | 61 | ... | <NA> | 70919 | 35660 | 208 | 4497 | 134881 | 2432 | 0.19643 | 0.18157 | 960.15 |
| 3 | ORD | Chicago, IL: Chicago O'Hare International | January | 2005 | 28194 | 1197 | 2255.000000 | 5415 | 5 | 306 | ... | 88691 | 160811 | 364382 | 151 | 24859 | 638894 | 9178 | 0.32553 | 0.377677 | 4502.25 |
| 4 | SAN | San Diego, CA: San Diego International | January | 2005 | 7283 | 572 | 680.000000 | 638 | 7 | 56 | ... | 27436 | 38445 | 21127 | 218 | 4326 | 91552 | 1953 | 0.268159 | 0.209511 | 674.7 |
5 rows × 21 columns
Using the new weather variable calculated above, create a barplot showing the proportion of all flights that are delayed by weather at each airport. Describe what you learn from this graph.
San Francisco has more delays than the other airports. I believe this could be to the fog that always surronds this airport and its cool to see this represented in data.
# Fix missing values in `num_of_delays_late_aircraft`
df["num_of_delays_late_aircraft"] = pd.to_numeric(df["num_of_delays_late_aircraft"], errors="coerce")
df["num_of_delays_late_aircraft"].fillna(df["num_of_delays_late_aircraft"].mean(), inplace=True)
# Convert num_of_delays_nas to numeric (handling any errors)
df["num_of_delays_nas"] = pd.to_numeric(df["num_of_delays_nas"], errors="coerce")
# Convert `Month` to numeric and handle missing values
month_mapping = {
"January": 1, "February": 2, "March": 3, "April": 4, "May": 5, "June": 6,
"July": 7, "August": 8, "September": 9, "October": 10, "November": 11, "December": 12
}
df["Month"] = df["month"].map(month_mapping)
df.dropna(subset=["Month"], inplace=True) # Drop rows with missing months
# Compute total weather-related delays
df["total_weather_delays"] = (
df["num_of_delays_weather"] +
(0.30 * df["num_of_delays_late_aircraft"]) +
np.where(df["Month"].between(4, 8), 0.40 * df["num_of_delays_nas"], 0.65 * df["num_of_delays_nas"])
)
# Calculate the proportion of total delays caused by weather
df["weather_delay_proportion"] = df["total_weather_delays"] / df["num_of_delays_total"]
# Remove rows where 'weather_delay_proportion' is NaN or infinite (caused by 0 total delays)
df_clean = df.dropna(subset=["weather_delay_proportion"])
df_clean = df_clean[df_clean["weather_delay_proportion"] != np.inf]
# Create the bar plot
p = ggplot(df_clean, aes(x="airport_code", y="weather_delay_proportion")) + \
geom_bar(stat="identity") + \
ggtitle("Proportion of Weather Delays Compared to Total Delays by Airport")
p.show()Which delay is the worst delay? Create a similar analysis as above for Weahter Delay with: Carrier Delay and Security Delay. Compare the proportion of delay for each of the three categories in a Chart and a Table. Describe your results.
Its cool to see that weather is the largest factor in delays. Its quite incredibly that all the delays that are experienced are out of our control.
# Include and execute your code here
# Calculate proportion of delays by type
df["carrier_delay_proportion"] = df["num_of_delays_carrier"] / df["num_of_flights_total"]
df["weather_delay_proportion"] = df["total_weather_delays"] / df["num_of_flights_total"]
df["security_delay_proportion"] = df["num_of_delays_security"] / df["num_of_flights_total"]
# Create a summary table
delay_summary = df.groupby("airport_code").agg({
"carrier_delay_proportion": "mean",
"weather_delay_proportion": "mean",
"security_delay_proportion": "mean"
}).reset_index()
# Melt data for plotting
delay_summary_melted = delay_summary.melt(id_vars=["airport_code"], var_name="delay_type", value_name="proportion")
# Plot
p = ggplot(delay_summary_melted, aes(x="airport_code", y="proportion", fill="delay_type")) + \
geom_bar(stat="identity", position="dodge") + \
ggtitle("Comparison of Different Types of Delays by Airport")
p.show()